Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

نویسندگان

  • Mark A. Hall
  • Geoff Holmes
چکیده

Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A novel feature selection techniques based on contrast set mining

Data classification is a challenging task in era of big data due to high number of features. Feature selection is a step in process of knowledge discovery in data that aims to reduce dimensionality and improve the classification performance. The purpose of this research is to define new techniques for feature selection in order to improve classification accuracy and reduce the time required for...

متن کامل

A Novel Tree Based Classification

Classification is a data mining (DM) technique used to predict or forecast the unknown information using the historical data. There are many classification techniques. ID3 is a very popular tree based classification algorithm for a categorical data which does not support continuous data. Attribute selection process plays major role in building a classification tree model. Attribute Selection in...

متن کامل

Bayesian Models to Assess Risk of Corruption of Federal Management Units

This paper presents a data mining project that generated Bayesian models to assess risk of corruption of federal management units. With thousands of extracted features related to corruptibility, the data were processed using techniques like correlation analysis and variance per class. We also compared two different discretization methods: Minimum Description Length Principle (MDLP) and Class-At...

متن کامل

Benchmarking Relief-Based Feature Selection Methods

Modern data mining requires feature selection methods that can (1) be applied to large scale feature spaces, (2) function in noisy problems, (3) detect complex patterns of association (e.g. interactions), (4) be flexibly adapted to various problem domains and data types, and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms ins...

متن کامل

Handling Sparse Data Sets by Applying Contrast Set Mining in Feature Selection

A data set is sparse if the number of samples in a data set is not sufficient to model the data accurately. Recent research emphasized interest in applying data mining and feature selection techniques to real world problems, many of which are characterized as sparse data sets. The purpose of this research is to define new techniques for feature selection in order to improve classification accur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Knowl. Data Eng.

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2003